evidence source
Learning to Seek Evidence: A Verifiable Reasoning Agent with Causal Faithfulness Analysis
Huang, Yuhang, Lin, Zekai, Zhong, Fan, Liu, Lei
Explanations for AI models in high-stakes domains like medicine often lack verifiability, which can hinder trust. To address this, we propose an interactive agent that produces explanations through an auditable sequence of actions. The agent learns a policy to strategically seek external visual evidence to support its diagnostic reasoning. This policy is optimized using reinforcement learning, resulting in a model that is both efficient and generalizable. Our experiments show that this action-based reasoning process significantly improves calibrated accuracy, reducing the Brier score by 18\% compared to a non-interactive baseline. To validate the faithfulness of the agent's explanations, we introduce a causal intervention method. By masking the visual evidence the agent chooses to use, we observe a measurable degradation in its performance ($Δ$Brier=+0.029), confirming that the evidence is integral to its decision-making process. Our work provides a practical framework for building AI systems with verifiable and faithful reasoning capabilities.
- Health & Medicine > Diagnostic Medicine > Imaging (1.00)
- Health & Medicine > Nuclear Medicine (0.94)
MedFact: A Large-scale Chinese Dataset for Evidence-based Medical Fact-checking of LLM Responses
Chen, Tong, Wang, Zimu, Miao, Yiyi, Luo, Haoran, Sun, Yuanfei, Wang, Wei, Jiang, Zhengyong, Sen, Procheta, Su, Jionglong
Medical fact-checking has become increasingly critical as more individuals seek medical information online. However, existing datasets predominantly focus on human-generated content, leaving the verification of content generated by large language models (LLMs) relatively unexplored. To address this gap, we introduce MedFact, the first evidence-based Chinese medical fact-checking dataset of LLM-generated medical content. It consists of 1,321 questions and 7,409 claims, mirroring the complexities of real-world medical scenarios. We conduct comprehensive experiments in both in-context learning (ICL) and fine-tuning settings, showcasing the capability and challenges of current LLMs on this task, accompanied by an in-depth error analysis to point out key directions for future research. Our dataset is publicly available at https://github.com/AshleyChenNLP/MedFact.
- Europe > Austria > Vienna (0.14)
- Asia > China > Shaanxi Province > Xi'an (0.04)
- North America > United States > New Mexico > Bernalillo County > Albuquerque (0.04)
- (7 more...)
Expediting data extraction using a large language model (LLM) and scoping review protocol: a methodological study within a complex scoping review
Stewart-Evans, James, Wilson, Emma, Langley, Tessa, Prayle, Andrew, Hands, Angela, Exley, Karen, Leonardi-Bee, Jo
The data extraction stages of reviews are resource-intensive, and researchers may seek to expediate data extraction using online (large language models) LLMs and review protocols. Claude 3.5 Sonnet was used to trial two approaches that used a review protocol to prompt data extraction from 10 evidence sources included in a case study scoping review. A protocol-based approach was also used to review extracted data. Limited performance evaluation was undertaken which found high accuracy for the two extraction approaches (83.3% and 100%) when extracting simple, well-defined citation details; accuracy was lower (9.6% and 15.8%) when extracting more complex, subjective data items. Considering all data items, both approaches had precision >90% but low recall (<25%) and F1 scores (<40%). The context of a complex scoping review, open response types and methodological approach likely impacted performance due to missed and misattributed data. LLM feedback considered the baseline extraction accurate and suggested minor amendments: four of 15 (26.7%) to citation details and 8 of 38 (21.1%) to key findings data items were considered to potentially add value. However, when repeating the process with a dataset featuring deliberate errors, only 2 of 39 (5%) errors were detected. Review-protocol-based methods used for expediency require more robust performance evaluation across a range of LLMs and review contexts with comparison to conventional prompt engineering approaches. We recommend researchers evaluate and report LLM performance if using them similarly to conduct data extraction or review extracted data. LLM feedback contributed to protocol adaptation and may assist future review protocol drafting.
- Europe > United Kingdom > England > Nottinghamshire > Nottingham (0.14)
- Asia > Middle East > Jordan (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.67)
Introducing Answered with Evidence -- a framework for evaluating whether LLM responses to biomedical questions are founded in evidence
Baldwin, Julian D, Dinh, Christina, Mukerji, Arjun, Sanghavi, Neil, Gombar, Saurabh
The growing use of large language models (LLMs) for biomedical question answering raises concerns about the accuracy and evidentiary support of their responses. To address this, we present Answered with Evidence, a framework for evaluating whether LLM-generated answers are grounded in scientific literature. We analyzed thousands of physician-submitted questions using a comparative pipeline that included: (1) Alexandria, fka the Atropos Evidence Library, a retrieval-augmented generation (RAG) system based on novel observational studies, and (2) two PubMed-based retrieval-augmented systems (System and Perplexity). We found that PubMed-based systems provided evidence-supported answers for approximately 44% of questions, while the novel evidence source did so for about 50%. Combined, these sources enabled reliable answers to over 70% of biomedical queries. As LLMs become increasingly capable of summarizing scientific content, maximizing their value will require systems that can accurately retrieve both published and custom-generated evidence--or generate such evidence in real time.
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (0.93)
Evaluating Evidential Reliability In Pattern Recognition Based On Intuitionistic Fuzzy Sets
Xu, Juntao, Zhan, Tianxiang, Deng, Yong
Determining the reliability of evidence sources is a crucial topic in Dempster-Shafer theory (DST). Previous approaches have addressed high conflicts between evidence sources using discounting methods, but these methods may not ensure the high efficiency of classification models. In this paper, we consider the combination of DS theory and Intuitionistic Fuzzy Sets (IFS) and propose an algorithm for quantifying the reliability of evidence sources, called Fuzzy Reliability Index (FRI). The FRI algorithm is based on decision quantification rules derived from IFS, defining the contribution of different BPAs to correct decisions and deriving the evidential reliability from these contributions. The proposed method effectively enhances the rationality of reliability estimation for evidence sources, making it particularly suitable for classification decision problems in complex scenarios. Subsequent comparisons with DST-based algorithms and classical machine learning algorithms demonstrate the superiority and generalizability of the FRI algorithm. The FRI algorithm provides a new perspective for future decision probability conversion and reliability analysis of evidence sources.
- Asia > China > Sichuan Province > Chengdu (0.04)
- North America > United States (0.04)
Learning Improved Representations by Transferring Incomplete Evidence Across Heterogeneous Tasks
Davvetas, Athanasios, Klampanos, Iraklis A.
Acquiring ground truth labels for unlabelled data can be a costly procedure, since it often requires manual labour that is error-prone. Consequently, the available amount of labelled data is increasingly reduced due to the limitations of manual data labelling. It is possible to increase the amount of labelled data samples by performing automated labelling or crowd-sourcing the annotation procedure. However, they often introduce noise or uncertainty in the labelset, that leads to decreased performance of supervised deep learning methods. On the other hand, weak supervision methods remain robust during noisy labelsets or can be effective even with low amounts of labelled data. In this paper we evaluate the effectiveness of a representation learning method that uses external categorical evidence called "Evidence Transfer", against low amount of corresponding evidence termed as incomplete evidence. Evidence transfer is a robust solution against external unknown categorical evidence that can introduce noise or uncertainty. In our experimental evaluation, evidence transfer proves to be effective and robust against different levels of incompleteness, for two types of incomplete evidence.
- North America > United States > Wisconsin > Dane County > Madison (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > Greece > Attica > Athens (0.04)
- Health & Medicine (0.46)
- Information Technology (0.46)
Multi-Source Fusion Operations in Subjective Logic
van der Heijden, Rens Wouter, Kopp, Henning, Kargl, Frank
The purpose of multi-source fusion is to combine information from more than two evidence sources, or subjective opinions from multiple actors. For subjective logic, a number of different fusion operators have been proposed, each matching a fusion scenario with different assumptions. However, not all of these operators are associative, and therefore multi-source fusion is not well-defined for these settings. In this paper, we address this challenge, and define multi-source fusion for weighted belief fusion (WBF) and consensus & compromise fusion (CCF). For WBF, we show the definition to be equivalent to the intuitive formulation under the bijective mapping between subjective logic and Dirichlet evidence PDFs. For CCF, since there is no independent generalization, we show that the resulting multi-source fusion produces valid opinions, and explain why our generalization is sound. For completeness, we also provide corrections to previous results for averaging and cumulative belief fusion (ABF and CBF), as well as belief constraint fusion (BCF), which is an extension of Dempster's rule. With our generalizations of fusion operators, fusing information from multiple sources is now well-defined for all different fusion types defined in subjective logic. This enables wider applicability of subjective logic in applications where multiple actors interact.
- Europe > Switzerland (0.04)
- Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.04)